Kernel: R

COO - single cell RNA sequencing data analysis

Bioinformatics in Neuroscience course, 2022

Single cell data analysis of the adult human brain

About this tutorial

This is the single cell RNA sequencing (scRNAseq) data analysis tutorial, prepared for the scRNAseq days of the Bioinformatics in Neuroscience course, organised for the Neuroscience and Cognition master and other Utrecht University master programs. The script follows the well known vignette of the Seurat (v3) pipeline for pbmc, adapted to analyse the scRNAseq data of the adult human brain.

The data is a subset of the recent work from Siletti et al (Bioarxiv, 2022; https://www.biorxiv.org/content/10.1101/2022.10.12.511898v1), which you received for this course. In this study, ~4M nuclei from different parts of the human brain is sequenced. You will later here a talk by the first author of the study, which will be a natural continuation of this practice. A subset of the original atlas is provided for the purpose of this practical in the file "../rawdata/A46_10X176_8_count_data.rds" . An accompanying metadata file, "../rawdata/A46_10X176_8_metadata.rds", will give you additional information. The rawdata as well as analysis of the entire dataset can be reached at: https://github.com/linnarsson-lab/adult-human-brain

Participants will then learn how to use the ‘Seurat’ pipeline, the most popular ‘R’ package to process and analyze scRNAseq data. The instructor will guide the students step-by-step, where a prewritten code will take through the first steps of scRNAseq data analysis. These include shaping the raw data, preprocessing, dimensionality reduction, basic clustering and differential gene expression analysis.

Learning goals

1. Gain the capacity to run a standard scRNAseq data analysis pipeline independently

2. Have the ability to interpret the results of single cell data analysis

The course used the CoCalc server: https://cocalc7.science.uu.nl/

During the tutorial, keep in mind

1. What are the major considerations during the preprocessing steps?

2. How does dimensionality reduction help us understand the data?

3. What is a molecular signature? How can marker genes help us undertand cellular (and brain) function?

Part I: Upload the data and create a Seurat object

Some good practice
To keep everything ordered, make some folders for saving the data and the figures. You will receive a warning if you have already made these folders

In [1]:
dir.create("write/") dir.create("figures/")
Out[1]:
Warning message in dir.create("write/"): “'write' already exists” Warning message in dir.create("figures/"): “'figures' already exists”

1.1 Load libraries

Load the necessary libraries to run our pipeline based on the well-known Seurat pipeline. You can find detailed information on Seurat as well as tutorials on its website (https://satijalab.org/seurat/) or on Github (https://github.com/satijalab/seurat)

We are going to use Seurat v3, which is well established. Note that there is a recent beta release (v4) with additional functionalities

In [155]:
# Load the libraries library(Seurat,verbose = FALSE) library(ggplot2,verbose = FALSE) library(RColorBrewer,verbose = FALSE) library(useful,verbose = FALSE) library(readr,verbose = FALSE) library(hdf5r,verbose = FALSE) # check if the following are on library(dplyr,verbose = FALSE) library(stringr,verbose = FALSE)

1.2 Load the data

We will use the 10x Genomics data generated from the adult human brain. Since 4M cells are too many, we are using a ~5k subset. The data is stored in two seperate files

In [156]:
# This contains the actual data as a cell x gene matrix. Each cell in this matrix contain information on how many reads we have for a given gene in a given cell count_data <- read_rds("../rawdata/A46_10X176_8_count_data.rds")
In [157]:
# the dimensions for a sanity check. The first dimension shows the number of features (in this case, genes) while the second indicates teh number of cells dim(count_data)
Out[157]:
  1. 36869
  2. 4734
In [158]:
# This contains the metadata. Each column contains information for each cell in teh dataset. metadata <- read_rds('../rawdata/A46_10X176_8_metadata.rds')
In [159]:
# what do we have in the metadata? colnames(metadata)
Out[159]:
  1. 'orig.ident'
  2. 'nCount_RNA'
  3. 'nFeature_RNA'
  4. 'Age'
  5. 'Class_auto.annotation'
  6. 'Clusters'
  7. 'Donor'
  8. 'Neuropeptide_auto.annotation'
  9. 'Neurotransmitter_auto.annotation'
  10. 'Roi'
  11. 'SampleID'
  12. 'Subtype_auto.annotation'
  13. 'Supercluster'
  14. 'Tissue'
  15. 'n_genes'
  16. 'n_counts'

1.3 Create Seurat object

Here, we will create a "seurat object" which has a different structure than a data table. Seurat object contains the data as a matrix in a separate slot, but can also have 'features' that are linked to genes or cells. For instance cell types for each cell can be stored in a separate slot. In addition, we can store additional information e.g. results of differential expression test or coordinates for each cell after a given dimensionality reduction (e.g. pca)

'min.cells' is to filter genes: only genes detected in min of these number of cells will be kept 'min.features' is to filter cells: only cells that have a minimum total of given number of features will be kept Thus, these two provide initial quality control of the data

In [160]:
dataset <- CreateSeuratObject(counts = count_data, min.cells = 1, min.features = 1000, meta.data = metadata, project = "Human_brain")

visualise the number of genes and features, which are already provided in the dataset

In [161]:
head(dataset@meta.data)
Out[161]:
A data.frame: 6 × 16
orig.identnCount_RNAnFeature_RNAAgeClass_auto.annotationClustersDonorNeuropeptide_auto.annotationNeurotransmitter_auto.annotationRoiSampleIDSubtype_auto.annotationSuperclusterTissuen_genesn_counts
<fct><dbl><int><dbl><chr><dbl><chr><chr><chr><chr><chr><chr><chr><chr><dbl><dbl>
10X176_8:TCGTCCATCGCCTAGG10X1762794172950NK 2H18.30.002nannanHuman A4610X176_8nanMiscellaneousCerebral cortex (Cx) - Middle frontal gyrus (MFG) - A4617292794
10X176_8:TCACACCGTGTCTTAG10X1762192136750MGL8H18.30.002IGFnanHuman A4610X176_8nanMicroglia Cerebral cortex (Cx) - Middle frontal gyrus (MFG) - A4613672192
10X176_8:TCGACGGGTTGCATTG10X1763368201050MGL8H18.30.002IGFnanHuman A4610X176_8nanMicroglia Cerebral cortex (Cx) - Middle frontal gyrus (MFG) - A4620103368
10X176_8:GAACACTAGTATGGCG10X1763871211650MGL8H18.30.002IGFnanHuman A4610X176_8nanMicroglia Cerebral cortex (Cx) - Middle frontal gyrus (MFG) - A4621163871
10X176_8:GTGCAGCTCGACCAAT10X1763729212450MGL8H18.30.002IGFnanHuman A4610X176_8nanMicroglia Cerebral cortex (Cx) - Middle frontal gyrus (MFG) - A4621243729
10X176_8:GACAGCCAGTGCACCC10X1763288189850MGL8H18.30.002IGFnanHuman A4610X176_8nanMicroglia Cerebral cortex (Cx) - Middle frontal gyrus (MFG) - A4618983288

The first visual quality control is to look at the number of cells (nCount_RNA) and number of genes (nFeature_RNA) per cells. These features are calculated while generating the Seurat object. For instance, the number of RNA molecules per cell is stored as 'nCount_RNA', which is located in dataset@meta.data$nCount_RNA

1.4 Visualise basic metrics

In [163]:
plot1 <- VlnPlot(object = dataset, features = c("nCount_RNA", "nFeature_RNA"),pt.size = .01) plot2 <- FeatureScatter(dataset, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") plot1 | plot2
Out[163]:
Image in a Jupyter notebook
Question 1.1

What do you see? :

1. How do the number of genes and RNA molecules correlate?

2. There are two populations visible on the violin plot... What do these mean?

Answer whenever you want!

Back to the presentation

Part II: Preprocessing

Quality control

2.1 Standard pre-processing workflow

set a name that will be used later for the files you save. Anything goes. You can add your own name

In [164]:
name = "my_analysis_name"

2.1.1 Calculate additional QC metrics

General information on the use of mitochondria and ribosomal proteins for QC
Mitochondrial RNA is the mRNA that is generated by the mitochondrial genome. Normally, these constitute a small fraction of the total mRNA. However, in dying or damaged cells, while cytoplasmic/nuclear mRNA degrades rapidly, mitochondrial mRNA is rather well preserved. Thus, a high ratio of mitochondrial mRNA indicates BAD cell quality.

mRNA coding for the Ribosomal subunit proteins is abundant (not to be confused with rRNA). Usually, a high ribosomal RNA percentage indicates production of a lot of proteins, and is very high in dividing cells or some secretory cells that need to constantly produce proteins. However, if most of the mRNA (>30-50%) that we detect is ribosomal, it means that we wont learn much from this cell, and that we should exclude it from the analysis.

Question 2.1

Note that we are analysing single nuclei data. We have the following question:

1. How do you think this will effect our quality control?

2. In this case, what does high ribosomal/mitochondrial percentage mean?

Spend 1 min to think about the questions. Then, talk to the person next to you about what you think. Finally, summarise it in one sentence

In [165]:
# Start by generating QC metrics additional to the no of genes/features dataset <- PercentageFeatureSet(dataset,pattern='^MT-', col.name = "percent.mt") #Select all features starting with MT-, which are the Mitochondrial genes dataset <- PercentageFeatureSet(dataset,pattern='RP(S|L)', col.name = "percent.ribo") # Select all that starts with RPS or RPL

Plot and inspect. You can remove the ots in case they cause too much moise

2.1.2 Visualise QC metrics

In [183]:
options(repr.plot.width=12, repr.plot.height=6) VlnPlot(object = dataset, features = c("percent.mt", "percent.ribo"))
Out[183]:
Image in a Jupyter notebook

You can save them all as a pdf file to use in presentations, reports etc

In [167]:
pdf(paste0(name,"_Vlnplot_stats_all.pdf"),height = 6,width = 12) VlnPlot(object = dataset, features = c("nCount_RNA", "nFeature_RNA","percent.mt", "percent.ribo")) # If there are too many points for your liking, you can turn them off as follows: #VlnPlot(object = dataset, features = c("nCount_RNA", "nFeature_RNA","percent.mt", "percent.ribo"),pt.size = 0, cols = "blue") dev.off()
Out[167]:
png: 2

Visualise multiple metrics for each cell as a measure of the number of reads (in other words; UMI, or unique RNA molecules). This helps use to evaluate the quality of the data in multiple ways

In [174]:
plot1 <- FeatureScatter(dataset, feature1 = "nCount_RNA", feature2 = "percent.mt") plot2 <- FeatureScatter(dataset, feature1 = "nCount_RNA", feature2 = "percent.ribo") plot1 | plot2
Out[174]:
Image in a Jupyter notebook

2.2 Filter

Decide on the filtering parameters that you want to use. This is not always objective! However, you can see the 'outliers' in the data rather well

In [175]:
#Now filter dataset <- subset(x = dataset, subset = percent.mt < 5 & nCount_RNA < 90000 & percent.ribo < 2)
In [176]:
dataset
Out[176]:
An object of class Seurat 36868 features across 4700 samples within 1 assay Active assay: RNA (36868 features, 0 variable features)

Now get rid of the mitochondrial genes which we do not need for downstream analysis.

In [177]:
dataset <- subset(x = dataset[-grep(pattern = "^MT-", x = rownames(dataset), value = FALSE)]) dataset
Out[177]:
An object of class Seurat 36842 features across 4700 samples within 1 assay Active assay: RNA (36842 features, 0 variable features)
In [0]:

Back to the presentation

Part III: Normalize and find variable features

Quality control

3.1 Normalisation

Standard flow of Seurat is replaced by a single command. However, it is good to see this part to learn each step

In [178]:
## normalise the data dataset <- NormalizeData(object = dataset, normalization.method = "LogNormalize", scale.factor = 10000)

The log transformed RNA counts from each cell is now normalised using a size factor. In other words, the sum of counts for each cells is more or less the same. We can now compare different cells to each other

3.2 Detection of variable genes across the single cells

In [179]:
## Here, we select the top 1,000 highly variable genes (Hvg) for downstream analysis. dataset <- FindVariableFeatures(object = dataset, selection.method = 'mean.var.plot', mean.cutoff = c(0.0125, 3), dispersion.cutoff = c(0.5, Inf)) length(x = VariableFeatures(object = dataset)) #3084 ## Identify the 10 most highly variable genes top10 <- head(VariableFeatures(dataset), 10)
Out[179]:
5676
In [191]:
## Plot # plot variable features with and without labels plot1 <- VariableFeaturePlot(dataset) plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE) #plot1 plot2
Out[191]:
When using repel, set xnudge and ynudge to 0 for optimal results Warning message: “ggrepel: 2 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
Image in a Jupyter notebook
In [194]:
# Generate a vector with the top 1000 hvgens hv.genes <- head(rownames(HVFInfo(object = dataset)), 1000)

Scale the data

Scale the values for each features. The result is in units of 'standard deviation' where the mean expression is zero. If a gene is expressed 1 standard deviation more than the mean across all cells, it will have a value of 1. During this process, we can also 'regress out' technical effects. For instance, the number of features and the mitochondrial percentage.

In [197]:
dataset <- ScaleData(object = dataset, features = hv.genes, vars.to.regress = c("nFeature_RNA") )
Out[197]:
Regressing out nFeature_RNA Centering and scaling data matrix

Back to the presentation

Part IV: Dimensionality reduction techniques

4.1 Perform dimensionality reduction

You can run PCA, t-SNE and UMAP in simple one-line codes... The parameters are NOT random, but we will ignore this for the sake of time

In [198]:
# Dimention reduction dataset <- RunPCA(object = dataset, features = hv.genes, verbose = FALSE,npcs = 50)
In [199]:
dataset <- RunUMAP(object = dataset, reduction = "pca", dims = 1:50, verbose = FALSE) dataset <- RunTSNE(object = dataset, dims = 1:50, verbose = FALSE)

4.2 Examine and visualize PCA results a few different ways

4.2.1 Dimplots

First, let's have a look at the first three dimensions. For this, we will take advantage of some of hte metadata generated by the authors: Supercluster information

In [205]:
options(repr.plot.width=12, repr.plot.height=6) DimPlot(dataset,cols = 'Spectral',group.by = 'Supercluster',pt.size = 1,reduction = 'pca',dims = c(1,2)) DimPlot(dataset,cols = 'Spectral',group.by = 'Supercluster',pt.size = 1,reduction = 'pca',dims = c(3,4))
Out[205]:
Warning message in RColorBrewer::brewer.pal(n, pal): “n too large, allowed maximum for palette Spectral is 11 Returning the palette you asked for with that many colors ” Warning message in RColorBrewer::brewer.pal(n, pal): “n too large, allowed maximum for palette Spectral is 11 Returning the palette you asked for with that many colors ”
Image in a Jupyter notebookImage in a Jupyter notebook
Question 4.1

1. What differences do you see between different principle components?

2. Some genes contribute more to the variation then others... What do these genes and principle components tell us?

Discuss with your neighbours, in groups of 3, for 2 min. We will go through the groups

4.2.2 List of genes that influence PCs that most in both (+/-) directions

You see the top genes that contribute to the respective principle components

In [206]:
print(dataset[["pca"]], dims = 1:5, nfeatures = 5)
Out[206]:
PC_ 1 Positive: PLP1, ST18, ENPP2, MOG, LINC01608 Negative: OBI1-AS1, AL137139.2, NHSL1, FGFR3, ATP1A2 PC_ 2 Positive: OBI1-AS1, SLC1A3, AL137139.2, FGFR3, ATP1A2 Negative: BTBD11, DLX1, DOCK11, ADARB2, SYTL5 PC_ 3 Positive: DOCK8, APBB1IP, P2RY12, FYB1, SLCO2B1 Negative: ADARB2, PROX1, EGFR, AL391832.4, CXCL14 PC_ 4 Positive: DOCK8, APBB1IP, P2RY12, FYB1, TBXAS1 Negative: FLT3, LHX6, AC022905.1, MYO5B, PAWR PC_ 5 Positive: FERMT1, AC004852.2, VCAN, PDGFRA, STK32A Negative: LPAR1, BTBD11, DAAM2, LHX6, SYTL5

4.2.3 Visualise genes that influence PCs the most

In [207]:
VizDimLoadings(dataset, dims = 1:2, reduction = "pca")
Out[207]:
Image in a Jupyter notebook

4.2.3 Heatmaps to visualise genes that influence PCs

In [212]:
options(repr.plot.width=12, repr.plot.height=9) DimHeatmap(dataset, dims = 1:4, cells = 500, balanced = TRUE, ncol = 2)
Out[212]:
Image in a Jupyter notebook

4.3 Determine statistically significant principal components

How many principle components do we need? There are some practical guidelines that will help with your analysis

4.3.1 JackStrawPlot()

We can do this with JackStrawPlot() function of Seurat. This is powerful, but time consuming. Running it altogether slows the server considerably. Also, we cannot visualise all different PCs (only up to 25). We will skip it for now (you can run it during the practice if you want to) and instead look at PCs using heatmaps

dataset <- JackStraw(object = dataset, num.replicate = 100) dataset <- ScoreJackStraw(object = dataset, dims = 1:20) JackStrawPlot(object = dataset, dims = 1:20)

4.3.2 Heatmaps

We will plot the top genes from selected PCs and see how they differe between cells.

In [221]:
options(repr.plot.width=18, repr.plot.height=9) DimHeatmap(dataset, dims = c(1,2,5,10, 15,20,25,30, 35,40,45,50), cells = 250, balanced = TRUE, ncol = 4)
Out[221]:
Image in a Jupyter notebook
Question 4.2

What do you see?

1. When do you think the variation is meaningful, and when not?

2. Some genes contribute more to the variation then others... What do these genes and principle components tell us?

3. What does it mean in terms of 'Superclusters'?

4.How would our results change if we include less or more principle components for downstream analysis?

Think for yourself for 2 min, then talk to your neighbors, in groups of 3, for 2 min. We will go through the groups

4.2.2 Elbowplot as an alternative

In [211]:
# Plots the standard deviations (or approximate singular values if running PCAFast) of the principle components for easy identification of an elbow in the graph. This elbow often corresponds well with the significant dims and speeds up running than Jackstraw # In this example, it looks like the elbow would fall around PC 11, then 25 and 35 again. After this, there is little change. options(repr.plot.width=12, repr.plot.height=6) ElbowPlot(object = dataset, reduction = "pca",ndims = 50)
Out[211]:
Image in a Jupyter notebook
Question 4.3

1. What does it mean in terms of 'Superclusters'?

2. When do you think the variation in a PC is meaningful, and when not?

3.How would our results change if we include less or more principle components for downstream analysis?

Think for yourself for 2 min, then talk to your neighbors, in groups of 3, for 2 min. We will go through the groups

Back to the presentation

Part V: Clustering

5.1 Find clusters

here, we will use graph-based clustering to identify groups of cells that are similar, which we will call 'clusters'

5.1.1 Use embeded 'louvain' algorithm for clustering

In [222]:
#SNN Graph Construction dataset <- FindNeighbors(object = dataset, dims = 1:35) #cell the pcs based on Jackstraw and Elbow plots #Find clusters using the louvain algorithm dataset <- FindClusters(object = dataset, resolution = 1) # changing the resolution will change the size/number of clusters! c(0.6, 0.8, 1)
Out[222]:
Computing nearest neighbor graph Computing SNN
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck Number of nodes: 4700 Number of edges: 177262 Running Louvain algorithm... Maximum modularity in 10 random starts: 0.8930 Number of communities: 23 Elapsed time: 0 seconds

5.1.2 Visualize QCs

Let's have a look at the QC stats agian. As default, this time the information will be displayed per cluster

In [231]:
# Statistics options(repr.plot.width=12, repr.plot.height=6) VlnPlot(object = dataset, features = c("nCount_RNA", "nFeature_RNA"),pt.size = 0)
Out[231]:
Image in a Jupyter notebook

5.2 Cell type identification

5.2.1 Visualise lineage specific genes curated from the literature

We provide some markers from the literature as a reference for identifying cell types. You can find more of these in Siletti et al 2022 Bioarxiv manuscript

ELAVL2, TUBB3: Neurons PLP1: Oligodendrocytes GFAP: Astrocytes ITGAM: Microglia PECAM1: Endothelial cells

5.2.1.1 t-SNE plots for visualising markers at single cell level
In [240]:
options(repr.plot.width=18, repr.plot.height=12) ColorRamp <- colorRampPalette(rev(brewer.pal(n = 7,name = "RdYlBu")))(100) # This is how to generate a color key FeaturePlot(object = dataset, features = c("ELAVL2","TUBB3","PLP1","GFAP","ITGAM","PECAM1"),reduction = 'tsne',cols = ColorRamp, ncol =3)
Out[240]:
Image in a Jupyter notebook

Neurons are very diverse! We can use additonal markers to classify them functionally as 'Glutamatergic' and 'GABAergic'

SLC17A5, SLC17A6: Glutamatergic neurons GAD1, GAD2, SLC32A1: GABAergic neurons

In [241]:
options(repr.plot.width=18, repr.plot.height=12) ColorRamp <- colorRampPalette(rev(brewer.pal(n = 7,name = "RdYlBu")))(100) # This is how to generate a color key FeaturePlot(object = dataset, features = c("ELAVL2","GAD1","GAD2","SLC17A5","SLC17A6","SLC32A1"),reduction = 'tsne',cols = ColorRamp, ncol =3)
Out[241]:
Image in a Jupyter notebook
5.2.1.2 Violin plot for cluster level average
In [242]:
options(repr.plot.width=18, repr.plot.height=6) VlnPlot(object = dataset, features = c("ELAVL2","TUBB3","PLP1","GFAP","ITGAM","PECAM1"),pt.size = 0, ncol =2) #, slot = 'counts' # you can plot raw counts as well
Out[242]:
Image in a Jupyter notebook
In [243]:
options(repr.plot.width=18, repr.plot.height=6) VlnPlot(object = dataset, features = c("ELAVL2","GAD1","GAD2","SLC17A5","SLC17A6","SLC32A1"),pt.size = 0, ncol =2) #, slot = 'counts' # you can plot raw counts as well
Out[243]:
Image in a Jupyter notebook

You can use marker gene information to name each cluster as

cell classes e.g. oligodendrocytes, neurons or marker-defined cell types e.g. GlutamatergicNeuron_0

5.2.2 Compare our clusters with the metadata

Your clusters are saved in the metadata, as part of the seurat object, with the name "seurat_clusters". We compare this to the metadata from the authors, which will help you decide which clusters belong to which cell type

In [244]:
plot21 <- DimPlot(object = dataset, reduction = 'tsne',group.by = "seurat_clusters") plot22 <- DimPlot(object = dataset, reduction = 'tsne',group.by = "Supercluster") plot23 <- DimPlot(object = dataset, reduction = 'tsne',group.by = "Clusters") #plot, adjusting the size of the graph options(repr.plot.width=18, repr.plot.height=6) plot21 | plot22 options(repr.plot.width=12, repr.plot.height=6) plot23
Out[244]:
Image in a Jupyter notebookImage in a Jupyter notebook

We can visualise the number of cells labeled with each label using the table() function

In [264]:
table(dataset@meta.data[['Supercluster']])
Out[264]:
Astrocyte CGE interneuron 288 565 Committed oligodendrocyte precursor Deep-layer corticothalamic and 6b 6 127 Deep-layer intratelencephalic Deep-layer near-projecting 599 57 Eccentric medium spiny neuron Fibroblast 1 4 LAMP5-LHX6 and Chandelier MGE interneuron 75 761 Microglia Miscellaneous 48 31 Oligodendrocyte Oligodendrocyte precursor 259 100 Splatter Upper-layer intratelencephalic 2 1774 Vascular 3
Question 5.1

1. Why are there such big differences in the number of clusters between different analysis?

Shoot your answer!

5.2.3 Visualise how different labels compare to each other

To compare them, we will do a trick. First, generate a dataframe with teh information you need, then use the table() function of make a confusion matrix

In [298]:
df <- as.data.frame(cbind(dataset@meta.data[['Supercluster']],dataset@meta.data[['seurat_clusters']])) rownames(df) <- colnames(dataset) colnames(df) <- c('Supercluster','seurat_clusters') #Supercluster Clusters
In [299]:
head(table(df)[,1:10]) #head() automatically looks at teh first 6 rows. We also shorthen the columns to 10
Out[299]:
seurat_clusters Supercluster 1 10 11 12 13 14 15 16 17 18 Astrocyte 0 0 0 0 0 1 0 0 0 60 CGE interneuron 0 0 0 0 153 2 0 0 93 0 Committed oligodendrocyte precursor 0 0 0 0 0 0 0 3 0 0 Deep-layer corticothalamic and 6b 0 0 0 0 0 5 119 0 0 0 Deep-layer intratelencephalic 18 0 0 160 0 26 0 0 0 0 Deep-layer near-projecting 0 0 0 0 0 1 0 0 0 0

Quick hetmap to visualise which cell types belong to which cluster

In [300]:
heatmap(table(df),xlab ='seurat_clusters')
Out[300]:
Image in a Jupyter notebook
Question 5.2

1. What does the heatmap tell you?

Shoot your answer!

Back to the presentation

Part VI: Differential gene expression analysis

An important step in scRNAseq data analysis is to find out the molecular signature (or genes) that discriminate a cluster from others. There are different ways of comparing clusters (e.g. two clusters, or one versus all, or a few clusters versus a few clusters) and different statistical methods to achieve this

6.1 find marker genes for each cluster

6.1.1 Start with a single cluster

We often want to see what makes a cluster special, or different than the rest. You can achieve this here

In [226]:
cluster1.markers <- FindMarkers(object = dataset, ident.1 = 0, min.pct = 0.25) #, ident.2 = c(0, 3) to compare head(x = cluster1.markers, n = 5)
Out[226]:
A data.frame: 5 × 5
p_valavg_log2FCpct.1pct.2p_val_adj
<dbl><dbl><dbl><dbl><dbl>
LAMA201.83741400.9300.2400
LINC0229601.58210470.9090.2200
AC117453.101.11470410.8260.1460
TXK00.48800590.7130.1000
COL5A201.77364320.8710.1650

6.1.2 Find markers for every cluster compared to all remaining cells, report only the positive ones

This is extremely useful to define what these cells are! However, you will see that a 'global comparison' is not always as accurate as you would expect

In [0]:
library(dplyr) dataset.markers <- FindAllMarkers(object = dataset, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) # we have just calculated the marker genes for each cell type

We can select the top genes for each cluster. Here, we will select the top 2 genes

In [304]:
dataset.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC) %>% head(n=10) # Takes a bit of time
Out[304]:
A grouped_df: 10 × 7
p_valavg_log2FCpct.1pct.2p_val_adjclustergene
<dbl><dbl><dbl><dbl><dbl><fct><chr>
0.000000e+002.1746040.9330.192 0.000000e+000AJ009632.2
0.000000e+001.8374140.9300.240 0.000000e+000LAMA2
3.432075e-2732.1720050.9800.3021.264445e-2681LINC02306
1.070594e-2722.0790031.0000.4723.944284e-2681CBLN2
3.836338e-2282.6755270.9430.2521.413384e-2232TSHZ2
7.190384e-1542.4214550.8380.2632.649081e-1492AC109466.1
6.939798e-1912.9317521.0000.6292.556760e-1863ERBB4
9.019916e-1722.4377870.9970.6143.323118e-1673ZNF804A
6.064442e-1792.5070010.8900.2602.234262e-1744AC109466.1
1.731711e-1602.4403030.9850.6286.379970e-1564CASC15

save the data as separate sheets in an excel file

In [228]:
library(openxlsx) ## Create Workbook object and add worksheets wb <- createWorkbook() ## Add the cluster info for ( n in unique(dataset.markers$cluster) ) { addWorksheet(wb, n) writeData(wb, n, dataset.markers[dataset.markers$cluster == n,], startCol = 1, startRow = 1, rowNames = TRUE, colNames = TRUE) } #save saveWorkbook(wb, paste(name,"_cell_clust_diff_genes.xlsx",sep=""), overwrite = TRUE)

6.2 DoHeatmap generates an expression heatmap for given cells and features.

In this case, we are plotting the top 20 markers (or all markers if less than 20) for each cluster.

In [245]:
dataset.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC) -> top10
In [310]:
options(repr.plot.width=12, repr.plot.height=12) #dataset <- ScaleData(object = dataset, features = rownames(top10$gene)) #,vars.to.regress = "percent.mt" DoHeatmap(object = dataset, features = top10$gene[1:100],slot = 'scale.data') + NoLegend()
Out[310]:
Warning message in DoHeatmap(object = dataset, features = top10$gene[1:100], slot = "scale.data"): “The following features were omitted as they were not found in the scale.data slot for the RNA assay: ROBO2, DPP10, ERBB4”
Image in a Jupyter notebook
Question 1

What do you see? :

1. How do the number of genes and RNA molecules correlate?

2. There are two populations visible on the violin plot... What do these mean?

Answer whenever you want!

Back to the presentation

Assignment

Workflow
1. Discuss a workplan, split the work
2. Go through the assignment individually
3. Come together and discuss your part with each others
4. Prepare a very short presentation

You can use Monday afternoon, as well as the practice session on Tuesday morning. The instructor will be available for questions through Slack on Monday, and in person at the practice session on Tuesday morning. We will have the time to wrap-up the summary at the end of the practice

To Do
This section is different for each group

More information: https://satijalab.org/seurat/

https://satijalab.org/seurat/v3.1/visualization_vignette.html

Extra code for the practical

subsetting neuronal clusters '0' and '1'

In [247]:
small <- subset(dataset, idents = c('0','1'))
In [248]:
small
Out[248]:
An object of class Seurat 36842 features across 1222 samples within 1 assay Active assay: RNA (36842 features, 5676 variable features) 3 dimensional reductions calculated: pca, umap, tsne
In [249]:
DimPlot(small, reduction = 'tsne')
Out[249]:
Image in a Jupyter notebook

You can proceed from step 3.2 to repeat the analysis of a subset of hte dataset that contains clusters 1 and 2

THE END